Sports car are commonly seen as powerful, fast, luxurious. It is usually considered as high-end vehicles. According to an article written by Mckinsey & Company in 2022, the sales of luxury vehicle have surpassed the mass market. In addition to traditional comfort and safety features, customers are increasingly seeking sports cars with enhanced functionality, such as improved engines, acceleration time, and horsepower, among other features.
In this project, the focus is on analyzing the Sports Car Prices dataset using the statistical software RStudio. The aim is to conduct in-depth linear regression analyses to identify the most suitable model for predicting the price of sports cars based on their functionality and technical specifications.
By developing an accurate regression model, car manufacturers, dealerships, and buyers can benefit from estimating the value of sports cars based on their specific features and characteristics. Furthermore, this study provides valuable insights into the factors that contribute to the pricing of sports cars, shedding light on how these factors impact the overall market value of these vehicles.
The Sports Car Prices Dataset used in this study was obtained from the Kaggle dataset repository. The dataset provides comprehensive information on various attributes of sports cars, including their prices. The dataset used in this analysis is publicly accessible and can be found at the following URL: https://www.kaggle.com/datasets/rkiattisak/sports-car-prices-dataset. It offers a valuable resource for investigating the factors influencing sports car prices and conducting regression analysis to gain insights into the relationships between these factors and the pricing of sports cars.
The data set we analyzes in this report contains information on sports cars and their prices. The data is collected in 2023 and has 1,007 observations. Each observation contains information about the prices and 7 features of various sports cars. The response variable in this dataset is the price of the car in USD, while the regressor variables include year of production, engine size in liters, horsepower, torque in pound-feet and the time in second it takes for the sports car to accelerate from 0 to 60 miles per hour. Significance level is set to 0.05 (5%).
There is 1 target variable and 7 regressor variables. For the 7 regressor variables, there are 4 numerical and 3 categorical variables.
Target feature:
Descriptive feature:
Numerical
Engine Size (L): The size of the sports car’s engine in liters, which represents the volume of the engine’s cylinders. It ranges from 1.5L to 8.4L.
Horsepower: The horsepower of the sports car, which represents the power output of the car’s engine. It ranges from 181 to 1,600. Some cars use electric engines.
Torque (lb-ft): The torque of the sports car in pound-feet, which represents the rotational force generated by the engine. It ranges from 151 to 1,300.
0-60 MPH Time (seconds): The time it takes for the sports car to accelerate from 0 to 60 miles per hour, which is a common measure of acceleration and performance. It ranges from 2.1 to 6.5 seconds.
Categorical
Car Make: The manufacturer of the sports car, which represents the brand or company that produced the car.
Car Model: The model of the sports car, which represents the specific version or variant of the car produced by the manufacturer.
Year: The year of production of the sports car, which indicates the model year when the car was first introduced or made available for purchase.
For this project, we will split our data set to training and test data sets with a ratio of 7:3, meaning that 70% of our observations is the training set and 30% is the testing set. We will apply a few feature selection methods in order to find the best features for prediction. Then, we will create a regression model using training set and evaluate its performance using the test set.
The following assumptions will be tested and verified:
In addition, we will use the selected regression model to make predictions of sports car prices and assess the model’s performance.
We will firstly clean and preprocess the data before analyzing.
library(leaps)
library(ggplot2)
library(GGally)
library(olsrr)
library(car)
library(stats)
library(tidyverse)
library(TSA)
library(QuantPsyc)
library(lmtest)
library(forecast)
The customized function ‘MASE’ calculates the Mean absolute scaled error, a measure of the accuracy of the prediction.
# Mean absolute scaled error
MASE = function(observed , fitted ){
# observed: Observed series on the forecast period
# fitted: Forecast values by your model
Y.t = observed
n = length(fitted)
e.t = Y.t - fitted
sum = 0
for (i in 2:n){
sum = sum + abs(Y.t[i] - Y.t[i-1] )
}
q.t = e.t / (sum/(n-1))
MASE = data.frame( MASE = mean(abs(q.t)))
return(list(MASE = MASE))
}
We observed that there are missing values called “N/A” and “-” in the dataset, so we specify the character strings to be treated as missing values.
# Import data set using read.csv
sportscar <- read.csv("Sportcarprice.csv", na.strings = c("N/A", "-"))
# Check the first 6 rows of the data
head(sportscar)
In this project, the focus of the analysis is on understanding the relationship between the functionality and technical specifications of sports cars and their prices. By excluding the car brand and model as a predictor, the analysis aims to explore the influence of specific features and characteristics on pricing, regardless of the brand.
This approach allows for a more direct examination of how features like engine size, acceleration time, and horsepower etc impact the pricing of sports cars, providing actionable insights for manufacturers, dealerships, and buyers. Therefore, we will drop the Car.make and Car.model columns.
sportscar_2 <- sportscar[, !(names(sportscar) %in% c("Car.Make", "Car.Model"))]
head(sportscar_2)
Data set is imported. Now we will check if there is any missing value using is.na().
# Check for missing values in all columns
sapply(sportscar_2, function(x) sum(is.na(x)))
## Year Engine.Size..L. Horsepower
## 0 11 0
## Torque..lb.ft. X0.60.MPH.Time..seconds. Price..in.USD.
## 4 0 0
It is found that there are 11 missing values in Engine Size and 4 missing value in Torque column. We will omit this value since we are in situations where the missing data (which is related to price) is unlikely to be imputed accurately.
# Omit missing values
data_clean <- na.omit(sportscar_2)
# Check if missing values are omitted
sapply(data_clean, function(x) sum(is.na(x)))
## Year Engine.Size..L. Horsepower
## 0 0 0
## Torque..lb.ft. X0.60.MPH.Time..seconds. Price..in.USD.
## 0 0 0
str() function is used to check the data type of all columns in a dataset. It will print a summary of the structure of the Sports Car dataset, including the data type of each column.
str(data_clean)
## 'data.frame': 994 obs. of 6 variables:
## $ Year : int 2022 2021 2022 2022 2021 2022 2021 2021 2022 2021 ...
## $ Engine.Size..L. : chr "3" "5.2" "3.9" "5.2" ...
## $ Horsepower : chr "379" "630" "661" "562" ...
## $ Torque..lb.ft. : chr "331" "443" "561" "406" ...
## $ X0.60.MPH.Time..seconds.: chr "4" "2.8" "3" "3.2" ...
## $ Price..in.USD. : chr "101,200" "274,390" "333,750" "142,700" ...
## - attr(*, "na.action")= 'omit' Named int [1:13] 169 172 223 248 336 388 390 643 687 698 ...
## ..- attr(*, "names")= chr [1:13] "169" "172" "223" "248" ...
After removing all missing values, we have 994 observations. The data types are all correct except “Engine Size”. It should be numeric but it is identified by R as “character”. Therefore, we check the unique values in Engine Size column to see if there is any character value.
# Print unique values in the "Engine Size" column of the data frame
unique(data_clean$Engine.Size..L.)
## [1] "3" "5.2" "3.9"
## [4] "4" "4.4" "6.2"
## [7] "3.8" "8" "5"
## [10] "3.5" "4.7" "2"
## [13] "2.9" "6" "Electric"
## [16] "6.5" "3.7" "Electric Motor"
## [19] "2.5" "1.5 + Electric" "6.8"
## [22] "8.4" "6.6" "7"
## [25] "1.7" "3.3" "6.7"
## [28] "1.8" "Electric (tri-motor)" "5.5"
## [31] "Electric (93 kWh)" "Electric (100 kWh)" "Hybrid (4.0)"
## [34] "4.6" "3.6" "1.5"
## [37] "Hybrid" "5.7" "2.0 (Electric)"
## [40] "4.0 (Hybrid)" "0" "6.4"
## [43] "6.3" "2.3"
It is found that there are other types of character values such as “Electric” and “Hybrid” in “Engine Size”. Since it is not accurate to impute electric or hybrid car engine having a value of 0 or other values to engine size, we will remove these rows with “Electric” and “Hybrid” in this report. Our report is studying only conventional sports car except electric and hybrid cars. If we can find accurate data about engine size of electric and hybrid cars, we can do further studies in the future.
Also, an engine size of “0” is likely not a valid or normal value, as it does not make practical sense for a physical engine to have zero size. It could be an error or missing data.
# Convert from character to numeric
data_clean$Engine.Size..L. <- as.numeric(data_clean$Engine.Size..L.)
## Warning: NAs introduced by coercion
# Remove all NA values
data_clean2 <- data_clean[!is.na(data_clean$Engine.Size..L.), ]
# Mark engine size equals to 0 as NA
data_clean2[data_clean2$Engine.Size..L.==0,] <- NA
# Remove all NA values
data_clean2 <- data_clean2[!is.na(data_clean2$Engine.Size..L.), ]
data_clean2$X0.60.MPH.Time..seconds. <- as.numeric(data_clean2$X0.60.MPH.Time..seconds.)
data_clean2$Horsepower <- as.integer(data_clean2$Horsepower)
## Warning: NAs introduced by coercion
data_clean2$Torque..lb.ft. <- as.integer(data_clean2$Torque..lb.ft.)
## Warning: NAs introduced by coercion
data_clean2$Price..in.USD. <- as.integer(gsub(",", "", data_clean2$Price..in.USD.))
# Check the updated data set
str(data_clean2)
## 'data.frame': 947 obs. of 6 variables:
## $ Year : int 2022 2021 2022 2022 2021 2022 2021 2021 2022 2021 ...
## $ Engine.Size..L. : num 3 5.2 3.9 5.2 4 4.4 4 6.2 5.2 3.8 ...
## $ Horsepower : int 379 630 661 562 710 617 523 490 760 600 ...
## $ Torque..lb.ft. : int 331 443 561 406 568 553 494 465 625 481 ...
## $ X0.60.MPH.Time..seconds.: num 4 2.8 3 3.2 2.7 3.1 3.8 2.8 3.5 2.5 ...
## $ Price..in.USD. : int 101200 274390 333750 142700 298000 130000 118500 59900 81000 212000 ...
## - attr(*, "na.action")= 'omit' Named int [1:13] 169 172 223 248 336 388 390 643 687 698 ...
## ..- attr(*, "names")= chr [1:13] "169" "172" "223" "248" ...
After removing these rows, there are 947 observations.
We convert the categorical “Year” variable to numeric “YearsSinceProduce”. It can provide a more meaningful representation of the data and help capture the age or depreciation of the car. By subtracting the “Year” from the current year, we can calculate the number of years since the car was manufactured.
Transforming the”Year” variable to “YearsSinceProduce” provides a variable that represents the age of the car model, independent of the specific year. This can facilitate the interpretation of regression coefficients by quantifying the number of years elapsed since the car was produced or released. It also ensures that the model remains applicable over time, as the “YearsSinceProduce” variable will consistently represent the number of years elapsed since the car was produced or released regardless of when the analysis is conducted.
# Configure current year
current_year <- as.integer(format(Sys.Date(), "%Y"))
# Calculate car age and add a column
data_clean2$YearsSinceProduce <- current_year - data_clean2$Year
# Remove column "Year"
data_clean2 <- subset(data_clean2, select = -Year)
head(data_clean2)
In this section, we will apply these descriptive statistical techniques to our dataset, focusing on summary statistics, histograms, and correlation matrices. By examining these measures and visualizations, we can uncover valuable insights and patterns, which will enhance our understanding of the data and lay the groundwork for further analysis and modeling.
# Print the summary statistics using the summary function
summary(data_clean2)
## Engine.Size..L. Horsepower Torque..lb.ft. X0.60.MPH.Time..seconds.
## Min. :1.5 Min. : 181.0 Min. : 151.0 Min. :2.100
## 1st Qu.:3.5 1st Qu.: 454.0 1st Qu.: 398.0 1st Qu.:3.000
## Median :4.0 Median : 580.0 Median : 505.0 Median :3.500
## Mean :4.4 Mean : 592.4 Mean : 511.7 Mean :3.589
## 3rd Qu.:5.2 3rd Qu.: 660.0 3rd Qu.: 590.0 3rd Qu.:4.000
## Max. :8.4 Max. :1600.0 Max. :1300.0 Max. :6.500
## NA's :1 NA's :1
## Price..in.USD. YearsSinceProduce
## Min. : 25000 Min. : 1.000
## 1st Qu.: 70348 1st Qu.: 2.000
## Median : 120000 Median : 3.000
## Mean : 342920 Mean : 2.823
## 3rd Qu.: 240500 3rd Qu.: 3.000
## Max. :5200000 Max. :59.000
##
For engine size, it ranges from 1.5L to 8.4L, with a median of 4.0L.
For horsepower, it ranges from 181 to 1600, with a median of 580.
For torque, it ranges from 151 to 1300 lb-ft, with a median of 505.
For 0-60 MPH Time (seconds), it is the time it takes for the sports car to accelerate from 0 to 60 mph ranges from 2.1 to 6.5 seconds, with a median value of 3.5 seconds.
For YearsSinceProduce, this variable represents the number of years since the sports car was produced. The minimum value is 1, indicating last year, and the maximum value is 59, with a median of 3.
For car price, the prices of the sports cars range from 25,000 to 5,200,000, with a median value of 120,000. The mean is higher than the median, it generally indicates that the distribution is positively skewed. We can visualize it using histogram.
In this section, we will explore histograms of the numeric variables in our dataset. By examining the shape, spread, and peaks of the histograms, we can gain a deeper understanding of the underlying distribution of each variable. This information can help us uncover important characteristics such as symmetry, skewness, or the presence of outliers. By visualizing the data in this way, we can better comprehend its distributional properties and make informed decisions about further analysis and modeling.
# Plot histogram of price
ggplot(data_clean2, aes(x = Price..in.USD.)) +
geom_histogram(binwidth = 100000, color = "black", fill = "purple") +
labs(title = "Histogram of Sports Car Price", x = "Sports Car Price", y = "Frequency")
From the histogram of the sports car price, we observe a highly skewed distribution towards the right. This indicates that the majority of sports cars in our dataset have relatively lower prices, while there are a few outliers with exceptionally high prices. The presence of these outliers suggests the existence of luxury or high-performance sports cars in our dataset, which significantly influence the overall distribution of prices. By examining the histogram, we can visually identify the extent of the skewness and the concentration of prices within different price ranges. This information will be valuable for understanding the price distribution and identifying any potential patterns or trends that may exist within our data.
Then, we check the histogram for the regressors variables.
# Plot histogram of engine size
ggplot(data_clean2, aes(x = Engine.Size..L.)) +
geom_histogram(binwidth = 0.5, color = "black", fill = "skyblue") +
labs(title = "Histogram of Engine Size", x = "Engine Size", y = "Frequency")
From the histogram of the engine size variable, we can observe a slightly skewed distribution towards the right. This indicates that the majority of sports cars in our dataset have engine sizes that are relatively small, while there are fewer cars with larger engine sizes. The skewness towards the right suggests that there may be a few sports cars with exceptionally large engine sizes, which contribute to the overall shape of the distribution.
# Plot histogram of horsepower
ggplot(data_clean2, aes(x = Horsepower)) +
geom_histogram(binwidth = 30, color = "black", fill = "red") +
labs(title = "Histogram of Horsepower", x = "Horsepower", y = "Frequency")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
The histogram of the horsepower variable reveals a right-skewed distribution. This indicates that the majority of sports cars in our dataset have relatively lower horsepower values, while a few cars exhibit exceptionally high horsepower. The presence of outliers with large horsepower values contributes to the elongated tail on the right side of the histogram.
# Plot histogram of torque
ggplot(data_clean2, aes(x = Torque..lb.ft.)) +
geom_histogram(binwidth = 30, color = "black", fill = "lightgreen") +
labs(title = "Histogram of Torque", x = "Torque", y = "Frequency")
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
Similarly, the histogram of the torque variable exhibits a similar pattern to that of horsepower. It is slightly skewed to the right, indicating that most sports cars in our dataset have relatively lower torque values, while a few cars possess significantly higher torque. The presence of outliers with large torque values contributes to the elongated tail on the right side of the histogram. B
# Plot histogram of MPH time
ggplot(data_clean2, aes(x = X0.60.MPH.Time..seconds.)) +
geom_histogram(binwidth = 0.3, color = "black", fill = "darkred") +
labs(title = "Histogram of MPH Time", x = "MPH Time", y = "Frequency")
The histogram of the 0-60 MPH Time variable reveals a slightly right-skewed distribution. This indicates that the majority of sports cars in our dataset have relatively faster acceleration times, with shorter durations to reach 60 miles per hour.
# Plot histogram of years since car production
ggplot(data_clean2, aes(x = YearsSinceProduce)) +
geom_histogram(binwidth = 1, color = "black", fill = "orange") +
labs(title = "Histogram of Years Since Car Production", x = "Years Since Car Production", y = "Frequency")
The histogram of the Years Since Car Production variable displays a highly right-skewed distribution. This suggests that the majority of sports cars in our dataset were produced relatively new, within the last three years. The concentration of cars in the more recent production years is evident from the higher frequency of observations in the corresponding bins on the right side of the histogram. A clear outlier is present, featuring a notably extended duration of 59 years since the car’s production.
Multicollinearity refers to the situation where two or more predictors in a regression model are highly correlated with each other. In other words, it is the presence of a linear relationship between two or more predictors. We will use correlation matrix of the predictor variables using ggpairs() function. Then we examine the correlation matrix to identify any strongly correlated pairs of predictor variables.
create_ggpairs <- function(data) {
# Select the numeric variables for analysis
numeric_vars <- c("Engine.Size..L.", "Horsepower", "Torque..lb.ft.", "X0.60.MPH.Time..seconds.", "YearsSinceProduce", "Price..in.USD.")
numeric_data <- data[, numeric_vars]
# Create the scatter plot matrix using ggpairs
ggpairs(numeric_data)
}
create_ggpairs(data_clean2)
From the table, the numeric variables are all skewed to the right. Horsepower and Toque are strongly related with a correlation coefficient of \(0.920\). It is a strong indication of multicollinearity. We will confirm multicollinearity by the Variance Inflation Factor (\(VIF\)) later.
As part of our model building strategies, we will firstly fit the full model, then perform residual analysis to see if any data transformation is needed. To conduct a linear regression analysis we have the following assumptions:
We perform full model fitting using linear regression. The objective is to predict the “Price..in.USD.” of sports cars based on various predictor variables. The lm() function is used to fit the model, and the summary() function is applied to obtain a summary of the model, including coefficient estimates and evaluation metrics.
# Full Model Fitting
model_full <- lm(Price..in.USD.~., data=data_clean2)
summary(model_full)
##
## Call:
## lm(formula = Price..in.USD. ~ ., data = data_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1796825 -172150 14313 119850 2955845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2230065.4 139446.8 -15.992 < 2e-16 ***
## Engine.Size..L. -78287.6 12161.1 -6.438 1.93e-10 ***
## Horsepower 2871.7 160.9 17.848 < 2e-16 ***
## Torque..lb.ft. 492.9 190.5 2.587 0.00983 **
## X0.60.MPH.Time..seconds. 247903.5 26948.5 9.199 < 2e-16 ***
## YearsSinceProduce 25107.1 6396.5 3.925 9.30e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 398600 on 940 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6643, Adjusted R-squared: 0.6625
## F-statistic: 372 on 5 and 940 DF, p-value: < 2.2e-16
The fitted multiple linear regression model is \(\hat Y= -2230065 - 78287 \hspace{0.1 cm} EngineSize +2871 \hspace{0.1 cm} Horsepower + 492 \hspace{0.1 cm} Torque + 247903 \hspace{0.1 cm} MPH time + 25107 \hspace{0.1 cm} YearsSinceProduce + \varepsilon\) where the independent error terms \(\varepsilon\) is assumed to follow a normal distribution with mean 0 and equal variance. Y is dependent or response variable. \(EngineSize\), \(Horsepower\), \(Torque\), \(MPH time\) and \(YearsSinceProduce\) are independent variables.
Null Hypothesis (\(H_0\)): There is no significant relationship between the predictors and the response variable.
Alternative Hypothesis (\(H_A\)): There is a significant relationship between the predictors and the response variable.
Based on the obtained p-value < 2.2e-16 , which is less than the significance level \(0.05\), we have sufficient evidence to reject the null hypothesis. This indicates that there is a statistically significant relationship between the predictors (independent variables) included in the regression model and the response variable (dependent variable). We can conclude that the overall model is significant.
The adjusted R-squared value is \(0.6625\), meaning that approximately 66.25% of the variability in the response variable is accounted for by the predictor variables in the regression model.
Multicollinearity is a problem that may impact the estimates of the individual regression coefficients. To test multicollinearity, we should also use the Variance Inflation Factor (\(VIF\)). The Variance Inflation Factor (VIF) is a measure of the amount of multicollinearity present in a set of predictor variables. A high VIF value indicates a high correlation between a predictor variable and the other variables in the model.
If \(VIF\) value is greater than 5, it indicates a significant and strong correlation between the predictor variable and the other variables in the model. The regressors that have high VIFs have poorly estimated regression coefficients
vif(lm(Price..in.USD.~., data=data_clean2))
## Engine.Size..L. Horsepower Torque..lb.ft.
## 1.694506 8.036152 6.816369
## X0.60.MPH.Time..seconds. YearsSinceProduce
## 2.250015 1.049548
As we can see, the VIF of Horsepower (8.036152) and Torque (6.816369) are considered significant in multicollinearity. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, making it difficult to determine the unique contribution of each variable to the response variable. We will address it during linear regression analysis stage.
Residuals analysis is important for regression models because it helps to check the assumptions of the model and the quality of the fit.
par(mfrow=c(2,2))
plot(model_full)
The Residuals vs Leverage plot helps us find influential cases using Cook’s distance. There are no outlying values at the upper right corner or lower right corner. The spots at the upper right corner or lower right corner are places where cases can be influential against a regression line.
Residuals vs Fitted plot shows if residuals have non-linear patterns. Most residuals are more or less spreading around the horizontal line. It can be an indication that we do not have non-linear relationships, but we will need to check the partial residual plots.
One key assumption of multiple linear regression is that there exists a linear relationship between each predictor variable and the response variable. To check this assumption, we create a partial residual plot, which displays the residuals of one predictor variable against the response variable. We can use crPlots() function to create partial residual plots.
crPlots(model_full)
The dotted blue line shows the expected residuals if the relationship between the predictor and response variable is linear. The pink line shows the actual residuals. If the two lines differ, it indicates evidence of non-linear relationship. We observe that the two lines are following each other closely. It is likely that there are no violations of the linearity assumption in the full model.
We check if the residuals have any autocorrelation by ACF and PACF plots.
par(mfrow=c(1,2))
acf(model_full$residuals, main = "The ACF of the residuals")
pacf(model_full$residuals, main = "The PACF of the residuals")
The presence of 1-2 significant lags in the ACF and PACF plots of residuals implies that there might be some remaining temporal patterns or dependencies in the model residuals that are not adequately captured by the current model.
To check auto-correlation statistically, we use Durbin-Watson test. The Durbin-Watson test is a statistical test used to check for the presence of autocorrelation in the residuals of a regression model. It examines whether there is a systematic pattern of correlation between consecutive residuals.
Null Hypothesis (\(H_0\)): There is no autocorrelation present in the residuals.
Alternative Hypothesis (\(H_A\)): There is autocorrelation present in the residuals.
durbinWatsonTest(model_full)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1077484 1.759247 0.002
## Alternative hypothesis: rho != 0
Based on the Durbin-Watson test result, the p-value is 0.002. Since the p-value is less than \(0.05\), we have sufficient evidence to reject the null hypothesis. This suggests that there is significant autocorrelation present in the residuals of the model.
The QQ plot is a valuable tool for assessing the normality of a dataset. It compares the observed quantiles of the data to the expected quantiles of a normal distribution. If the points on the plot closely follow a straight line, it suggests that the data follows a normal distribution. Deviations from the straight line indicate departures from normality, such as skewness or heavy tails.
qqnorm(model_full$residuals, xlab="Normal Scores")
qqline(model_full$residuals)
We observe deviated right tails in the QQ plot. To test normality statistically, we can use the Shapiro-Wilk test.
Null Hypothesis (\(H_0\)): The residuals of the full model are normally distributed.
Alternative Hypothesis (\(H_A\)): The residuals of the full model are not normally distributed.
shapiro.test(model_full$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_full$residuals
## W = 0.72639, p-value < 2.2e-16
Based on the Shapiro-Wilk test result, where the p-value is less than \(0.05\), we reject the null hypothesis of normality for the residuals of the full model. Since the p-value is less than \(0.05\), we have sufficient evidence to reject the null hypothesis. This suggests that the residuals of the full model do not follow a normal distribution.
The ncvTest, or non-constant variance test, is used to assess whether there is evidence of heteroscedasticity in a regression model. It examines the relationship between the residuals and the fitted values of the model. A significant result from the ncvTest indicates the presence of heteroscedasticity, suggesting that the variance of the errors is not constant across different levels of the predictors.
Null Hypothesis (\(H_0\)): The variance of the residuals is constant (homoscedasticity).
Alternative Hypothesis (\(H_A\)): The variance of the residuals is not constant (heteroscedasticity).
ncvTest(model_full)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 498.4595, Df = 1, p = < 2.22e-16
Since the p-value is less than \(0.05\), we have strong evidence to reject the null hypothesis. This suggests that there is significant evidence of heteroscedasticity in the residuals of the model. To deal with heteroscedasticity, we can apply data transformation.
As we find violations of normality and heteroscedasticity, we consider data transformations (e.g. logarithmic and Box-Cox transformation) to address non-normality. As the data is highly right skewed observed from Histogram of Sports Car Price, we will firstly consider logarithmic transformation.
# Take log of Price
lm.log <- lm(log(Price..in.USD.) ~ YearsSinceProduce + Horsepower + Engine.Size..L. + X0.60.MPH.Time..seconds., data = data_clean2)
summary(lm.log)
##
## Call:
## lm(formula = log(Price..in.USD.) ~ YearsSinceProduce + Horsepower +
## Engine.Size..L. + X0.60.MPH.Time..seconds., data = data_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94322 -0.34382 0.02803 0.27564 2.46780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2521751 0.1931530 58.255 < 2e-16 ***
## YearsSinceProduce 0.0648458 0.0088729 7.308 5.77e-13 ***
## Horsepower 0.0035026 0.0001368 25.600 < 2e-16 ***
## Engine.Size..L. -0.1217822 0.0165072 -7.378 3.54e-13 ***
## X0.60.MPH.Time..seconds. -0.2826933 0.0374518 -7.548 1.04e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.554 on 941 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7185, Adjusted R-squared: 0.7173
## F-statistic: 600.6 on 4 and 941 DF, p-value: < 2.2e-16
The fitted multiple linear regression model is \(\hat Y = 11.2521751 - 0.1217822 \hspace{0.1 cm} EngineSize +0.0035026 \hspace{0.1 cm} Horsepower -0.2826933 \hspace{0.1 cm} MPH time + 0.0648458 \hspace{0.1 cm} YearsSinceProduce + \varepsilon\) where the independent error terms \(\varepsilon\) is assumed to follow a normal distribution with mean 0 and equal variance. Y is dependent or response variable. \(EngineSize\), \(Horsepower\), \(MPH time\) and \(YearsSinceProduce\) are independent variables. All variables are significant.
# Plot histogram of log Price
ggplot(data_clean2, aes(x = log(Price..in.USD.) )) +
geom_histogram(binwidth = 0.1, color = "purple", fill = "purple") +
labs(title = "Histogram of Log-Transformed Sports Car Price", x = "Log-Transformed Sports Car Price", y = "Frequency")
After log transformation, the distribution is obviously less right skewed when compared to the original histogram of Sports Car Price, but it is still positively skewed.
# Residual Plots
par(mfrow=c(2,2))
plot(lm.log)
The Residuals vs Leverage plot helps us find influential cases using Cook’s distance. A point in the right bottom corner of the Residuals vs Leverage plot suggests that the corresponding observation has a notable impact on the model, potentially influencing the regression coefficients.
Residuals vs Fitted plot shows if residuals have non-linear patterns. Most residuals are more or less spreading around the horizontal line. It can be an indication that we do not have non-linear relationships, but we will need to check the partial residual plots.
One key assumption of multiple linear regression is that there exists a linear relationship between each predictor variable and the response variable. To check this assumption, we create a partial residual plot, which displays the residuals of one predictor variable against the response variable. We can use crPlots() function to create partial residual plots.
crPlots(lm.log)
The dotted blue line shows the expected residuals if the relationship between the predictor and response variable is linear. The pink line shows the actual residuals. If the two lines differ, it indicates evidence of non-linear relationship. We observe that the two lines are pretty close except YearsSinceProduce.
We check if the residuals have any autocorrelation by ACF and PACF plots.
par(mfrow=c(1,2))
acf(lm.log$residuals, main = "The ACF of the residuals")
pacf(lm.log$residuals, main = "The PACF of the residuals")
From ACF and PACF plot, there are still 2-3 significant lags. Auto-correlation exists. To check auto-correlation statistically, we use Durbin-Watson test. It examines whether there is a systematic pattern of correlation between consecutive residuals.
Null Hypothesis (\(H_0\)): There is no autocorrelation present in the residuals.
Alternative Hypothesis (\(H_A\)): There is autocorrelation present in the residuals.
durbinWatsonTest(lm.log)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04965602 1.883178 0.062
## Alternative hypothesis: rho != 0
Based on the Durbin-Watson test result, the p-value is \(0.07\). Since the p-value is larger than \(0.05\), we do not have sufficient evidence to reject the null hypothesis. This suggests that there is no autocorrelation present in the residuals of the model.
The QQ plot is a valuable tool for assessing the normality of a dataset. It compares the observed quantiles of the data to the expected quantiles of a normal distribution. If the points on the plot closely follow a straight line, it suggests that the data follows a normal distribution.
qqnorm(lm.log$residuals, xlab="Normal Scores")
qqline(lm.log$residuals)
We still observe many points deviate from the straight line in the QQ plot. It suggests that the data does not follow a normal distribution. To test normality statistically, we can use the Shapiro-Wilk test.
Null Hypothesis (\(H_0\)): Errors are normally distributed.
Alternative Hypothesis (\(H_A\)): Errors are not normally distributed.
shapiro.test(lm.log$residuals)
##
## Shapiro-Wilk normality test
##
## data: lm.log$residuals
## W = 0.94882, p-value < 2.2e-16
Unfortunately, the p-value is still very low. Normality has no improvement. We reject the null hypothesis. There is strong evidence that errors are not normally distributed.
The ncvTest, or non-constant variance test, is used to assess whether there is evidence of heteroscedasticity in a regression model. A significant result from the ncvTest indicates the presence of heteroscedasticity, suggesting that the variance of the errors is not constant across different levels of the predictors.
Null Hypothesis (\(H_0\)): The variance of the residuals is constant (homoscedasticity).
Alternative Hypothesis (\(H_A\)): The variance of the residuals is not constant (heteroscedasticity).
ncvTest(lm.log)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 28.03898, Df = 1, p = 1.189e-07
Since the p-value is less than \(0.05\), we have strong evidence to reject the null hypothesis. This suggests that there is significant evidence of heteroscedasticity in the residuals of the model. To deal with heteroscedasticity, we can apply data transformation or weighted least squares.
Logarithmic transformation does not work effectively to correct normality and heteroscedasticity. Thus, we will try another approach to fix normality.
To address the issues identified in the residual analysis, we can try using the box-cox function from the MASS package to perform the Box-Cox transformation. The Box-Cox transformation helps identify the optimal power transformation that maximizes the likelihood of the data being normally distributed.
# Perform the Box-Cox transformation
transformed_data <- boxcox(Price..in.USD. ~ YearsSinceProduce + Horsepower + Engine.Size..L. + X0.60.MPH.Time..seconds., data = data_clean2)
# Extract the optimal lambda value
lambda <- transformed_data$x[which.max(transformed_data$y)]
# Apply the Box-Cox transform to the response variable
data_clean2$Price_transformed <- ((data_clean2$Price..in.USD.^lambda) - 1) / lambda
lambda
## [1] -0.1818182
Lamba is found to be -0.1818182.
# Plot histogram of BC transformed car price
ggplot(data_clean2, aes(x = Price_transformed)) +
geom_histogram(binwidth = 0.01, color = "blue", fill = "blue") +
labs(title = "Histogram of BC Transformed Sports Car Price", x = "BC Transformed Sports Car Price", y = "Frequency")
The histogram of Sports Car Price has greatly improved from the original histogram after Box-Cox transformation. It looks more normally distributed than before.
Next we fit the model with the transformed response variable.
model_full_transformed <- lm(Price_transformed ~ YearsSinceProduce + Horsepower + Torque..lb.ft. + Engine.Size..L. + X0.60.MPH.Time..seconds., data=data_clean2)
summary(model_full_transformed)
##
## Call:
## lm(formula = Price_transformed ~ YearsSinceProduce + Horsepower +
## Torque..lb.ft. + Engine.Size..L. + X0.60.MPH.Time..seconds.,
## data = data_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.15237 -0.03854 0.00202 0.03367 0.22591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.876e+00 2.107e-02 231.392 < 2e-16 ***
## YearsSinceProduce 6.607e-03 9.667e-04 6.835 1.47e-11 ***
## Horsepower 4.093e-04 2.432e-05 16.834 < 2e-16 ***
## Torque..lb.ft. -1.369e-04 2.879e-05 -4.754 2.30e-06 ***
## Engine.Size..L. -1.069e-02 1.838e-03 -5.815 8.30e-09 ***
## X0.60.MPH.Time..seconds. -4.350e-02 4.073e-03 -10.680 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06024 on 940 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6966, Adjusted R-squared: 0.6949
## F-statistic: 431.6 on 5 and 940 DF, p-value: < 2.2e-16
All predictor variables are significant with a p-value lower than 0.05. The overall model is significant.
We will perform residual analysis again with the transformed data.
# Residual Plots
par(mfrow=c(2,2))
plot(model_full_transformed)
The Residuals vs Leverage plot helps us find influential cases using Cook’s distance. There is still one outlying value at the bottom right corner.
Residuals vs Fitted plot shows if residuals have non-linear patterns. Most residuals are more or less spreading around the horizontal line. It can be an indication that we do not have non-linear relationships, but we will need to check the partial residual plots.
crPlots(model_full_transformed)
The pink line isfollowing the blue line, which indicates that it is likely that there are no violations of the linearity assumption in this model.
We check if the residuals have any autocorrelation by ACF and PACF plots.
par(mfrow=c(1,2))
acf(model_full_transformed$residuals, main = "The ACF of the residuals")
pacf(model_full_transformed$residuals, main = "The PACF of the residuals")
There are still 1-3 significant lags left in ACF and PACF plots of residuals. Autocorrelation in the residuals might indicate that our model is missing relevant predictor variables or that there are unaccounted systematic patterns in the data. To check statistically, we will use Durbin-Watson test.
Null Hypothesis (\(H_0\)): There is no autocorrelation present in the residuals.
Alternative Hypothesis (\(H_A\)): There is autocorrelation present in the residuals.
durbinWatsonTest(model_full_transformed)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.003145495 1.992887 0.914
## Alternative hypothesis: rho != 0
Based on the Durbin-Watson test result, the p-value is \(0.888\). Since the p-value is larger than \(0.05\), we do not have sufficient evidence to reject the null hypothesis. This suggests that there is no autocorrelation present in the residuals of the model.
The QQ plot is a valuable tool for assessing the normality of a dataset. It compares the observed quantiles of the data to the expected quantiles of a normal distribution. If the points on the plot closely follow a straight line, it suggests that the data follows a normal distribution.
qqnorm(model_full_transformed$residuals, xlab="Normal Scores")
qqline(model_full_transformed$residuals)
There are still positive tails deviated from the line. The Box-Cox transformed data is still skewed to the right.
Null Hypothesis (\(H_0\)): Errors are normally distributed.
Alternative Hypothesis (\(H_A\)): Errors are not normally distributed.
shapiro.test(model_full_transformed$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_full_transformed$residuals
## W = 0.97782, p-value = 8.115e-11
Based on the Shapiro-Wilk test result, although there is improvement in the normality in terms of p-value, the p-value is still way smaller than \(0.05\), we have sufficient evidence to reject the null hypothesis. This suggests that errors are not normally distributed.
The ncvTest, or non-constant variance test, is used to assess whether there is evidence of heteroscedasticity in a regression model. A significant result from the ncvTest indicates the presence of heteroscedasticity, suggesting that the variance of the errors is not constant across different levels of the predictors.
Null Hypothesis (\(H_0\)): Errors have a constant variance.
Alternative Hypothesis (\(H_A\)): Errors have a non-constant variance.
ncvTest(model_full_transformed)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.121874, Df = 1, p = 0.72701
From ncvTest, the p-value is greater than \(0.05\), we do not have evidence to reject the null hypothesis. This suggests that errors have a non-constant variance. This implies that constant error variance assumption is not violated.
We will continue our analyses using the Box-Cox transformed data because error variance assumption becomes not violated. However, we should be aware that residuals are not normal.
In this section, a test-train split is performed to partition the dataset into training and testing subsets. The aim is to use the training subset to build a predictive model and evaluate its performance on the independent testing subset. The split is done using a 70-30 ratio, where 70% of the data is allocated to the training set and the remaining 30% is assigned to the test set. To ensure reproducibility, a random seed of 999 is set before sampling the data. The resulting training set (train) will be used for model development, while the test set (test) will be used for evaluating the model’s performance.
# 70% as training
sample_size = floor(0.7 * nrow(data_clean2))
# Set the seed to make partition reproducible
set.seed(999)
train_set <- sample(seq_len(nrow(data_clean2)), size = sample_size)
train <- data_clean2[train_set,]
test <- data_clean2[-train_set,]
In this section, we will perform all possible regressions and then select the best model from the candidates models.
Firstly, we fit a full multiple linear regression model using the lm() function using Car Age, Horsepower, Torque, Engine Size, MPH time as predictors in the dataset. Box-Cox transformed price is the response variable. Then we will make use of features selection methods to select features that best explains the variation in sport car prices. Three candidate models are considered and the performance and validity of results of each model is evaluated using significance tests and model adequacy tests.
Our first model is built using the full set of predictor variables in the data.
# Model 1 - Full Model using train data
model1 <- lm(Price_transformed ~ YearsSinceProduce + Horsepower + Torque..lb.ft. + Engine.Size..L. + X0.60.MPH.Time..seconds., data=train)
summary(model1)
##
## Call:
## lm(formula = Price_transformed ~ YearsSinceProduce + Horsepower +
## Torque..lb.ft. + Engine.Size..L. + X0.60.MPH.Time..seconds.,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.147035 -0.037615 0.000925 0.036125 0.228707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.875e+00 2.602e-02 187.326 < 2e-16 ***
## YearsSinceProduce 9.820e-03 2.611e-03 3.761 0.000185 ***
## Horsepower 4.128e-04 2.874e-05 14.365 < 2e-16 ***
## Torque..lb.ft. -1.361e-04 3.330e-05 -4.086 4.93e-05 ***
## Engine.Size..L. -1.264e-02 2.280e-03 -5.544 4.29e-08 ***
## X0.60.MPH.Time..seconds. -4.433e-02 4.864e-03 -9.113 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05897 on 655 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7, Adjusted R-squared: 0.6977
## F-statistic: 305.6 on 5 and 655 DF, p-value: < 2.2e-16
The fitted multiple linear regression model is \(\hat Y= 4.875e+00 - 1.264e-02 \hspace{0.1 cm} EngineSize + 4.128e-04 \hspace{0.1 cm} Horsepower -1.361e-04 \hspace{0.1 cm} Torque -4.433e-02 \hspace{0.1 cm} MPH time + 9.820e-03 \hspace{0.1 cm} YearsSinceProduce + \varepsilon\) where the independent error terms \(\varepsilon\) is assumed to follow a normal distribution with mean 0 and equal variance. Y is dependent or response variable. \(EngineSize\), \(Horsepower\), \(Torque\), \(MPH time\) and \(YearsSinceProduce\) are independent variables.
All predictor variables are significant at 5% level of significance. The overall model has a p-value less than \(0.05\). This suggests strong evidence that all predictors in the model are significantly associated with the response variable. This model has a adjusted R-square of \(0.6977\), meaning that 69.77% of the variability in the response variable (sports car price) can be explained by the predictor variables (car age, engine size, torque, horsepower, and MPH time). The adjusted R-squared takes into account the number of predictors in the model and adjusts the R-squared value accordingly.
We could use ANOVA (Analysis of Variance) to test the significance of model 1.
\(H_0\): The fit of intercept only model and the current model is the same, meaning that additional variables do not provide values taken together.
\(H_A\) The fit of intercept only model is significantly less when compared to our current model.
summary_1 <- ols_regress(Price_transformed ~ YearsSinceProduce + Horsepower + Torque..lb.ft.+ Engine.Size..L. + X0.60.MPH.Time..seconds., data = train)
summary_1
## Model Summary
## -------------------------------------------------------------
## R 0.837 RMSE 0.059
## R-Squared 0.700 Coef. Var 1.213
## Adj. R-Squared 0.698 MSE 0.003
## Pred R-Squared 0.694 MAE 0.046
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ---------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------
## Regression 5.313 5 1.063 305.606 0.0000
## Residual 2.278 655 0.003
## Total 7.591 660
## ---------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------------
## (Intercept) 4.875 0.026 187.326 0.000 4.824 4.926
## YearsSinceProduce 0.010 0.003 0.084 3.761 0.000 0.005 0.015
## Horsepower 0.000 0.000 0.867 14.365 0.000 0.000 0.000
## Torque..lb.ft. 0.000 0.000 -0.226 -4.086 0.000 0.000 0.000
## Engine.Size..L. -0.013 0.002 -0.163 -5.544 0.000 -0.017 -0.008
## X0.60.MPH.Time..seconds. -0.044 0.005 -0.298 -9.113 0.000 -0.054 -0.035
## ------------------------------------------------------------------------------------------------------
The ANOVA table gives us a F-statistics of \(305.606\) and p-value 0.0000, suggesting that the regression is significant at 5% level of significance.
We will check assumption of multiple linear regression for model 1 to confirm validity of the results. Residuals Analysis is important for regression models because it helps to check the assumptions of the model and the quality of the fit.We check the residual plot to assess the normality and constant variance assumptions.
Linearity
plot(model1, which=1)
Residuals vs Fitted plot shows if residuals have non-linear patterns. Most residuals are equally spread around the horizontal line. It is good indication that we do not have non-linear relationships.
One key assumption of multiple linear regression is that there exists a linear relationship between each predictor variable and the response variable. To check this assumption, we create a partial residual plot, which displays the residuals of one predictor variable against the response variable. We can use crPlots() function to create partial residual plots.
crPlots(model1)
The dotted blue line shows the expected residuals if the relationship between the predictor and response variable is linear. The pink line shows the actual residuals. If the two lines differ, it indicates evidence of non-linear relationship. Overall speaking the pink line is following the dotted blue line well.
Normality
QQ plot shows if residuals are normally distributed. Most residuals follow the straight line, but some points deviate from the line at the both end of the data. The QQ-plot suggest that the residuals are violating normality at the tail end of the data.
plot(model1, which=2)
We can also test the assumptions statistically. The Shapiro-Wilk test is a statistical test that checks whether a set of data follows a normal distribution.
Null Hypothesis (\(H_0\)): Errors are normally distributed.
Alternative Hypothesis (\(H_A\)): Errors are not normally distributed.
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.97978, p-value = 6.462e-08
Based on the Shapiro-Wilk test result, the p-value is less than \(0.05\), we reject the null hypothesis. We have sufficient evidence to reject the null hypothesis. This implies that normality error assumption is violated.
Autocorrelation
We use ACF and PACF plots to check if there is autocorrelation left in the residuals.
par(mfrow=c(1,2))
acf(model1$residuals, main = "The ACF of the residuals")
pacf(model1$residuals, main = "The PACF of the residuals")
There is still 1-2 significant lags left, but they are at late lag. Overall, there is not much auto-correlation left in the residuals. We will use Durbin-Watson test to test statistically. It is used to check for the presence of autocorrelation in the residuals of a regression model.The durbinWatsonTest() function takes the regression model and returns the lag one sample autocorrelation coefficient, \(DW\) test statistics and p value.
Null Hypothesis (\(H_0\)): Errors are uncorrelated.
Alternative Hypothesis (\(H_A\)): Errors are correlated.
durbinWatsonTest(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.06489718 1.870148 0.086
## Alternative hypothesis: rho != 0
Since the p-value is \(>\) \(0.05\), we do not have enough evidence to reject \(H_0\) that there is no autocorrelation. This implies that uncorrelated error assumption is not violated.
Error Variance
plot(model1, which=3)
Scale-Location plot shows if residuals are spread equally along the ranges of predictors. This is used for checking assumption of equal variance. The line is slightly horizontal with equally spread points. It may indicate homoscedasticity. To test error variance statistically, we can use ncvTest.
Null Hypothesis (\(H_0\)): Errors have a constant variance.
Alternative Hypothesis (\(H_A\)): Errors have a non-constant variance.
ncvTest(model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.06256627, Df = 1, p = 0.80248
Since the p-value is \(>\) \(0.05\), we cannot reject \(H_0\). This implies that constant error variance assumption is not violated.
Multicollinearity
Variance inflation factor (VIF) is used to check multicollinearity of the model.
vif(model1)
## YearsSinceProduce Horsepower Torque..lb.ft.
## 1.087725 7.949121 6.671375
## Engine.Size..L. X0.60.MPH.Time..seconds.
## 1.876404 2.332417
As we can see, the VIF of Horsepower (7.949121) and Torque (6.671375) are considered significant in multicollinearity. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated, making it difficult to determine the unique contribution of each variable to the response variable.
Edmondson (2011) stated that torque and horsepower are closely related and cannot be decoupled from each other. Horsepower is the product of torque and a couple of constants. With this strong evidence, we can safely respecify the model and remove one of these two variables from the model. We will remove the one with weaker evidence of association with the response variable.
car::Anova(model1, type="III") # library(car)
Among horsepower and torque variables, we will consider dropping the variable with the higher p-value, which in this case is “Torque..lb.ft.” (p-value = 4.928e-05) based on ANOVA table. Here is our updated Model 2.
# Remove Torque Variable
# Model 2
model2 <- lm(Price_transformed~ YearsSinceProduce + Horsepower + Engine.Size..L. + X0.60.MPH.Time..seconds., data=train)
summary(model2)
##
## Call:
## lm(formula = Price_transformed ~ YearsSinceProduce + Horsepower +
## Engine.Size..L. + X0.60.MPH.Time..seconds., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.178670 -0.036437 0.002223 0.036318 0.213531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.865e+00 2.621e-02 185.593 < 2e-16 ***
## YearsSinceProduce 1.118e-02 2.621e-03 4.264 2.30e-05 ***
## Horsepower 3.227e-04 1.865e-05 17.307 < 2e-16 ***
## Engine.Size..L. -1.492e-02 2.238e-03 -6.665 5.62e-11 ***
## X0.60.MPH.Time..seconds. -4.427e-02 4.922e-03 -8.994 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05967 on 656 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.6923, Adjusted R-squared: 0.6904
## F-statistic: 369 on 4 and 656 DF, p-value: < 2.2e-16
The fitted multiple linear regression model is \(\hat Y= 4.875e+00 -1.492e-02 \hspace{0.1 cm} EngineSize + 3.227e-04 \hspace{0.1 cm} Horsepower -4.427e-02 \hspace{0.1 cm} MPH time + 1.118e-02 \hspace{0.1 cm} YearsSinceProduce + \varepsilon\) where the independent error terms \(\varepsilon\) is assumed to follow a normal distribution with mean 0 and equal variance. Y is dependent or response variable. \(EngineSize\), \(Horsepower\), \(MPH time\) and \(YearsSinceProduce\) are independent variables.
All predictor variables are significant at 5% level of significance. The overall model has a p-value less than \(0.05\). This suggests strong evidence that all predictors in the model are significantly associated with the response variable. This model has a adjusted R-square of \(0.6904\), meaning that 69.04% of the variability in the response variable (sports car price) can be explained by the predictor variables (car age, engine size, horsepower, and MPH time). The adjusted R-squared takes into account the number of predictors in the model and adjusts the R-squared value accordingly.
Now we will check the VIF to see if we have successfully handled multicollinearity.
vif(lm(Price_transformed~ YearsSinceProduce + Horsepower + Engine.Size..L. + X0.60.MPH.Time..seconds., data=train))
## YearsSinceProduce Horsepower Engine.Size..L.
## 1.070170 3.268073 1.764784
## X0.60.MPH.Time..seconds.
## 2.332399
After removing the torque variable, the VIF of all variables are considered insignificant in multicollinearity. We have addressed the multicollinearity problem.
We could use ANOVA (Analysis of Variance) to test the significance of model 2.
\(H_0\): The fit of intercept only model and the current model is the same, meaning that additional variables do not provide values taken together.
\(H_A\) The fit of intercept only model is significantly less when compared to our current model.
summary_2 <- ols_regress(Price_transformed ~ YearsSinceProduce + Horsepower + Engine.Size..L. + X0.60.MPH.Time..seconds., data=train)
summary_2
## Model Summary
## -------------------------------------------------------------
## R 0.832 RMSE 0.060
## R-Squared 0.692 Coef. Var 1.227
## Adj. R-Squared 0.690 MSE 0.004
## Pred R-Squared 0.688 MAE 0.047
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ---------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------
## Regression 5.255 4 1.314 369.003 0.0000
## Residual 2.336 656 0.004
## Total 7.591 660
## ---------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------------
## (Intercept) 4.865 0.026 185.593 0.000 4.813 4.916
## YearsSinceProduce 0.011 0.003 0.096 4.264 0.000 0.006 0.016
## Horsepower 0.000 0.000 0.678 17.307 0.000 0.000 0.000
## Engine.Size..L. -0.015 0.002 -0.192 -6.665 0.000 -0.019 -0.011
## X0.60.MPH.Time..seconds. -0.044 0.005 -0.297 -8.994 0.000 -0.054 -0.035
## ------------------------------------------------------------------------------------------------------
The ANOVA table gives us a F-statistics of \(369.003\) and p-value 0.0000, suggesting that the regression is significant at 5% level of significance.
We will check assumption of multiple linear regression for model 2 to confirm validity of the results. Residuals Analysis is important for regression models because it helps to check the assumptions of the model and the quality of the fit.
Linearity
plot(model2, which=1)
Residuals vs Fitted plot shows if residuals have non-linear patterns. Most residuals are equally spread around the horizontal line. It is positive indication that we do not have non-linear relationships.
One key assumption of multiple linear regression is that there exists a linear relationship between each predictor variable and the response variable. To check this linearity assumption, we create a partial residual plot, which displays the residuals of one predictor variable against the response variable. We can use crPlots() function to create partial residual plots.
crPlots(model2)
The dotted blue line shows the expected residuals if the relationship between the predictor and response variable is linear. The pink line shows the actual residuals. If the two lines differ, it indicates evidence of non-linear relationship. We observe that the pink lines are following the dotted blue line well.
Normality
QQ plot shows if residuals are normally distributed. Most residuals follow the straight line, but some points deviate from the line at the both end of the data. The QQ-plot suggest that the residuals are violating normality at the tail end of the data.
plot(model2, which=2)
We can also test the assumptions statistically. The Shapiro-Wilk test is a statistical test that checks whether a set of data follows a normal distribution.
Null Hypothesis (\(H_0\)): Errors are normally distributed.
Alternative Hypothesis (\(H_A\)): Errors are not normally distributed.
shapiro.test(model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model2$residuals
## W = 0.98607, p-value = 6.129e-06
Based on the Shapiro-Wilk test result, the p-value is less than \(0.05\), we reject the null hypothesis. We have sufficient evidence to reject the null hypothesis. This implies that normality error assumption is violated.
Autocorrelation
ACF and PACF plots are used to check autocorrelation left in the residuals.
par(mfrow=c(1,2))
acf(model2$residuals, main = "The ACF of the residuals")
pacf(model2$residuals, main = "The PACF of the residuals")
There is still 1 to 2 significant lags left in the residuals, but they are at late lag. Overall, there is not much auto-correlation left in the residuals. We will use Durbin-Watson test to test statistically. It is used to check for the presence of autocorrelation in the residuals of a regression model.The durbinWatsonTest() function takes the regression model and returns the lag one sample autocorrelation coefficient, \(DW\) test statistics and p value.
Null Hypothesis (\(H_0\)): Errors are uncorrelated.
Alternative Hypothesis (\(H_A\)): Errors are correlated.
durbinWatsonTest(model2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.07076597 1.858355 0.07
## Alternative hypothesis: rho != 0
Since the p-value is \(>\) \(0.05\), we do not have enough evidence to reject \(H_0\) that there is no autocorrelation. This implies that uncorrelated error assumption is not violated.
Error Variance
plot(model2, which=3)
Scale-Location plot shows if residuals are spread equally along the ranges of predictors. This is used for checking assumption of equal variance. The line is slightly horizontal with equally spread points. It may indicate homoscedasticity. To test error variance statistically, we can use ncvTest.
Null Hypothesis (\(H_0\)): Errors have a constant variance.
Alternative Hypothesis (\(H_A\)): Errors have a non-constant variance.
ncvTest(model2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.006035501, Df = 1, p = 0.93808
Since the p-value is \(>\) \(0.05\), we cannot reject \(H_0\). This implies that constant error variance assumption is not violated.
Stepwise elimination is used to select the variables that gives the best Akaike Information Criterion (AIC) value. Both ols_step_both_aic() and step() methods can be used to perform stepwise regression
ols_step_both_aic() performs a stepwise regression using both forward and backward selection based on the AIC criterion. It starts with an empty model and iteratively adds or removes variables until no further improvement in the AIC value is achieved. The final model selected is the one with the lowest AIC. Based on ols_step_both_aic() \(AIC\)-value approach, the variables that should be entered are all four predictor variables.
ols_step_both_aic(model1)
##
##
## Stepwise Summary
## --------------------------------------------------------------------------------------------
## Variable Method AIC RSS Sum Sq R-Sq Adj. R-Sq
## --------------------------------------------------------------------------------------------
## Horsepower addition -1705.933 2.904 4.687 0.61750 0.61692
## X0.60.MPH.Time..seconds. addition -1797.745 2.519 5.072 0.66811 0.66710
## Engine.Size..L. addition -1827.724 2.400 5.191 0.68378 0.68234
## Torque..lb.ft. addition -1846.314 2.327 5.264 0.69348 0.69161
## YearsSinceProduce addition -1858.435 2.278 5.313 0.69996 0.69767
## --------------------------------------------------------------------------------------------
It suggests that all five predictor variables should be included based on AIC approach.
Next we will try all-possible regression approach. We use \(adjr2\) as our model selector criterion.
all_possible_reg <-leaps::regsubsets(Price_transformed ~ YearsSinceProduce + Horsepower + Torque..lb.ft.+ Engine.Size..L. + X0.60.MPH.Time..seconds., data=train)
summary(all_possible_reg)
## Subset selection object
## Call: regsubsets.formula(Price_transformed ~ YearsSinceProduce + Horsepower +
## Torque..lb.ft. + Engine.Size..L. + X0.60.MPH.Time..seconds.,
## data = train)
## 5 Variables (and intercept)
## Forced in Forced out
## YearsSinceProduce FALSE FALSE
## Horsepower FALSE FALSE
## Torque..lb.ft. FALSE FALSE
## Engine.Size..L. FALSE FALSE
## X0.60.MPH.Time..seconds. FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## YearsSinceProduce Horsepower Torque..lb.ft. Engine.Size..L.
## 1 ( 1 ) " " "*" " " " "
## 2 ( 1 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " "*"
## 4 ( 1 ) " " "*" "*" "*"
## 5 ( 1 ) "*" "*" "*" "*"
## X0.60.MPH.Time..seconds.
## 1 ( 1 ) " "
## 2 ( 1 ) "*"
## 3 ( 1 ) "*"
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
The resulting table provides information about the different models generated by the All Possible Subsets regression. It gives us the best models at each variable number. We can use graphical table to visualize the best subsets.
plot(all_possible_reg, scale="adjr2")
Based on the graph above, by adjusted \(R^2\), the best model includes all predictor variables (variables that have black boxes at the highest adjusted \(R^2\) value).
As the all-possible and stepwise regression approach consistently suggest that the full model (Model 1) is the best, it indicates that all the included variables are important for explaining the response variable. As a result, we do not have any new candidate model arise in this step.
For model 3, we explore other distributions that are appropriate for our data. As our price data follows a right skewed distribution, we can try fitting a GLM regression model to the data using gamma distribution.
In the context of Generalized Linear Models (GLMs), the gamma distribution is a suitable choice for the response variable when certain characteristics are present. Specifically, it is commonly used when the dependent variable is continuous, positively skewed, and its mean is related to the predictor variables through a log link function. The gamma distribution accommodates situations where the relationship between the predictors and the response is expected to be multiplicative rather than additive. By modeling the data using the gamma distribution and a log link, the GLM can capture these characteristics and provide appropriate inference and interpretation of the results.
# Use original Price data
glm.fit <- glm(Price..in.USD. ~ YearsSinceProduce + Horsepower + Engine.Size..L. + X0.60.MPH.Time..seconds.,
family = Gamma(link="log"), data=train)
summary.glm(glm.fit)
##
## Call:
## glm(formula = Price..in.USD. ~ YearsSinceProduce + Horsepower +
## Engine.Size..L. + X0.60.MPH.Time..seconds., family = Gamma(link = "log"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6104 -0.4182 -0.1148 0.1289 3.3147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4442000 0.3624293 31.576 < 2e-16 ***
## YearsSinceProduce 0.0559944 0.0362379 1.545 0.123
## Horsepower 0.0032196 0.0002578 12.488 < 2e-16 ***
## Engine.Size..L. -0.0283972 0.0309434 -0.918 0.359
## X0.60.MPH.Time..seconds. -0.3539309 0.0680579 -5.200 2.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.6807261)
##
## Null deviance: 979.03 on 660 degrees of freedom
## Residual deviance: 226.43 on 656 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 17005
##
## Number of Fisher Scoring iterations: 9
Only horsepower and MPH time predictor variables are significant with p-value less than 0.05. We will check if the model deviance indicates that the GLM model is satisfactory.
\(H_0\): The model fits the data well. \(H_A\): The model does not fit the data well.
deviance(glm.fit)
## [1] 226.43
# p value for the goodness of fit test
pchisq(glm.fit$deviance, df=glm.fit$df.residual, lower.tail=FALSE)
## [1] 1
The deviance() function is used to calculate the deviance of a fitted generalized linear model (glm). The deviance is a measure of the lack of fit of the model to the observed data. A lower deviance value indicates a better fit of the model to the data, suggesting that the model explains a larger proportion of the variability in the response variable.
The pchisq() function is then applied to obtain the p-value for the goodness of fit test using the deviance and the residual degrees of freedom. The chi-square test statistic of \(226.43\) gives a p-value of \(1\), indicating that the null hypothesis is plausible, and we can conclude that the GLM model provides an adequate fit to the data.
Next we perform a type 3 partial deviance analysis of the model parameters, to see if we should remove any regressors from the model.
Anova(glm.fit,type = 3)
It indicates that Engine Size and YearsSinceProduce variables should be removed, as their p-values are greater than 0.05.
We will check assumption of multiple linear regression for GLM model to confirm validity of the results. Residuals Analysis is important for regression models because it helps to check the assumptions of the model and the quality of the fit.
# residuals vs fitted
plot(glm.fit, which=1)
Residuals vs Fitted plot shows if residuals have non-linear patterns. Most of the residuals are equally spread around the horizontal line. It is good indication that we do not have non-linear relationships.
Linearity
To check linear relationship assumption, we create a partial residual plot, which displays the residuals of one predictor variable against the response variable. We can use crPlots() function to create partial residual plots.
crPlots(glm.fit)
The dotted blue line shows the expected residuals if the relationship between the predictor and response variable is linear. The pink line shows the actual residuals. If the two lines differ, it indicates evidence of non-linear relationship. The pink line is following the dotted blue line closely. It is good indication that there is no non-linear relationship.
Normality
QQ plot shows if residuals are normally distributed. Most residuals follow the straight line, but some points deviate from the line at the both end of the data. The QQ-plot clearly suggests that the residuals are violating normality at the tail end of the data.
plot(glm.fit, which=2)
There are a lot of points deviated from the straight line. We can also test the assumptions statistically. The Shapiro-Wilk test is a statistical test that checks whether a set of data follows a normal distribution.
Null Hypothesis (\(H_0\)): Errors are normally distributed.
Alternative Hypothesis (\(H_A\)): Errors are not normally distributed.
res_dev = residuals(glm.fit, type="deviance")
shapiro.test(res_dev)
##
## Shapiro-Wilk normality test
##
## data: res_dev
## W = 0.87885, p-value < 2.2e-16
Based on the Shapiro-Wilk test result, the p-value is less than \(0.05\), we reject the null hypothesis. We have sufficient evidence to reject the null hypothesis. This implies that normality error assumption is violated.
Autocorrelation
par(mfrow=c(1,2))
acf(res_dev, main = "The ACF of the residuals")
pacf(res_dev, main = "The PACF of the residuals")
Similar to model 1 and 2, there is 1-2 significant late lags. There is not much auto-correlation left in the residuals. We will use Durbin-Watson test to test statistically.
Null Hypothesis (\(H_0\)): Errors are uncorrelated.
Alternative Hypothesis (\(H_A\)): Errors are correlated.
durbinWatsonTest(glm.fit)
## lag Autocorrelation D-W Statistic p-value
## 1 0.07853834 1.842809 0
## Alternative hypothesis: rho != 0
The p-value is \(<\) \(0.05\), we have enough evidence to reject \(H_0\) that there is no autocorrelation left in the residuals. This implies that uncorrelated error assumption is violated.
Error Variance
plot(glm.fit, which=3)
Scale-Location plot shows if residuals are spread equally along the ranges of predictors. This is used for checking assumption of equal variance. The line is slightly horizontal with equally spread points. It may indicate homoscedasticity.
Multicollinearity
Variance inflation factor (VIF) is used to check multicollinearity of the model.
vif(glm.fit)
## YearsSinceProduce Horsepower Engine.Size..L.
## 1.070170 3.268073 1.764784
## X0.60.MPH.Time..seconds.
## 2.332399
There is no multicollinearity in the model. They all have VIF equal or less than 3.268073.
Anova results show that all the models are significant at 5% level of significance. Although we have performed Box-Cox transformation, all three models fails to meet the normality assumption.
Model 1 and Model 2 are chosen over GLM Model since Model 1 and Model 2 show better residual characteristics. Uncorrelated error assumption is violated in GLM Model.
Also, only two variables are significant in GLM Model, but all predictor variables are significant in the Model 1 and Model 2.
Model 1 VS Model 2
To compare two linear regression models and determine the better fit, we will assess their adjusted R-squared, F-statistic, RMSE, ANOVA, and press statistics. These metrics allow us to evaluate the models’ explanatory power, overall significance, predictive accuracy, and ability to explain variation in the response variable. By considering these measures, we can make an informed comparison and select the model that best fits the data.
# Create table to compare Model 1 and Model 2
Metrics <- c("Number of variables", "Multiple R-squared", "Adjusted R-squared" ,
"F-statistic","RMSE")
model_1_table <- c(round(summary_1$n-1,0), summary_1$rsq, summary_1$adjr, summary_1$f, sqrt(summary_1$mse))
model_2_table <- c(round(summary_2$n-1,0), summary_2$rsq, summary_2$adjr , summary_2$f , sqrt(summary_2$mse))
table <- data.frame(Metrics, model_1_table, model_2_table , stringsAsFactors=F)
colnames(table) <- c("Metrics", "Model 1", "Model 2")
table
Both Model 1 and Model 2 can explain approximately 69% of the variation in Sports Car Price, as shown from Adjusted R-squared value. Model 1 is able to explain 69.77% of variation in sports car price with 5 variables and Model 2 is explaining almost 69.04% of variation with 4 variables.
Model 1 has an F-statistic of 305, while Model 2 has a higher F-statistic of 369 The F-statistic measures the overall significance of the model. A higher F-statistic indicates a better overall fit of the model to the data.
Model 1 has an RMSE (Root Mean Squared Error) of 0.059, while Model 2 has a slightly higher RMSE of 0.060 The RMSE represents the average difference between the predicted values and the actual values, with lower values indicating better predictive performance.
ANOVA
The ANOVA test is used to compare the goodness-of-fit between two models. The resulting p-value helps determine whether there is sufficient evidence to reject the null hypothesis and conclude that there is a significant difference in the fit between the models.
Null Hypothesis(\(H_0\)): There is no significant difference in the fit between Model 1 and Model 2.
Alternative Hypothesis(\(H_A\)): There is a significant difference in the fit between Model 1 and Model 2.
Comp_pvalue <- anova(model1, model2)
Comp_pvalue
The anova gives a p-value of \(4.928e-05\) lower than 0.05, suggesting that the two models are significantly different. In the context of model comparison, a small p-value suggests that the additional predictors included in the model compared to the other significantly improve the fit of the model. It indicates that the variables included in the more complex model (Model 1) contributes significantly to explaining the variation in the response variable.
The model with the lowest values of PRESS indicates the best structure. The “Press” statistic is a measure of predictive accuracy for a regression model. It quantifies the sum of squared differences between the predicted values and the actual values for each observation, with one observation excluded at a time.
# Press statistics for Model 1
DAAG::press(model1)
## [1] 2.319341
# Press statistics for Model 2
DAAG::press(model2)
## [1] 2.368002
Model 1 has a Press statistic of 2.319341, while Model 2 has a Press statistic of 2.368002. The lower the Press statistic, the better the predictive accuracy of the model. Therefore, Model 1 has a slightly better predictive accuracy than Model 2, as it has a lower Press statistic, which means that including the variable Torque to the model is significantly recommended.
However. we should consider the fact that Model 1 has Multicollinearity issue supported by Variance inflation factor (VIF). Horsepower and Torque are highly correlated, making it difficult to determine the unique contribution of each variable to the response variable. After removal of Torque variable, Model 2 does not have Multicollinearity issue. Therefore, even though Model 1 has lower Press statistics and has significant difference in the fit, we consider using Model 2 in this report as a better fit for the sports car price data. Model 2 also has a higher F-statistic, indicating a better fit of the model.
Model 2 will be used for explaining the variations in sports car price and for predicting the sports car price in test data.
The coefficients of the Model 2 are as follows.
# coefficients of Model 2
model2$coefficients
## (Intercept) YearsSinceProduce Horsepower
## 4.8647055406 0.0111753198 0.0003226898
## Engine.Size..L. X0.60.MPH.Time..seconds.
## -0.0149153313 -0.0442702839
It is evident from the coefficients of model 2 that:
Since we have removed Torque variable, the relationship between price and torque is not testified. The rest of the assumptions regarding the model are verified and model 2 is then used to predict sports car price using test data.
# Make prediction using model 2
pred <- as.data.frame(predict(model2, test[,c(1,2,4,6)],interval = "prediction"))
# Add the actual values from the test dataset to the 'pred' data frame
pred$actual_value <- test$Price_transformed
# Display the predicted and actual values
head(pred)
Mean Absolute Scaled Error (MASE) is less sensitive to outliers when compared to other error metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE). It calculates the errors as a ratio, which reduces the influence of extreme values and makes it a robust measure of forecast accuracy.
# Calculate MASE using customized function
MASE(pred$fit, pred$actual_value)
## $MASE
## MASE
## 1 0.5205948
The MASE of the prediction using Model 2 is good with a value less than 1. Hence, the model can be regarded as sufficient in answering our research questions and predicting sports car prices.
accuracy_data <- accuracy(pred$fit, pred$actual_value)
accuracy_data
## ME RMSE MAE MPE MAPE
## Test set 0.004773298 0.06576629 0.05050642 0.0825648 1.035433
The accuracy_data object contains several metrics that assess the accuracy of the predicted values compared to the actual values in the test set. For example, RMSE (Root Mean Squared Error) is the square root of the mean squared differences between the predicted and actual values, which measures the average magnitude of the prediction errors. MAE (Mean Absolute Error) is the mean absolute difference between the predicted and actual values, which measures the average absolute magnitude of the prediction errors.
In our case, the accuracy metrics indicate that the model’s predictions have a small mean error (ME), low root mean squared error (RMSE), and mean absolute error (MAE). The mean percentage error (MPE) and mean absolute percentage error (MAPE) are also within acceptable ranges, indicating reasonable accuracy of the model’s predictions on the test set.
In this study, we aim to develop a regression model to predict the price of conventional sports cars (excluding Electric or Hybrid sports car) based on their functionality and technical specifications. Our analysis included numerical features such as engine size, horsepower, torque, 0-60 MPH time and years since production.
To address issues of non-normality, heteroscedasticity in the data, we performed log and Boxcox transformation. The Boxcox transformation yielded the best results, allowing us to proceed with model fitting. We split the dataset into training and testing sets to evaluate the performance of our models. We considered three regression models: the full model (model 1), a model without the Torque variable to address multicollinearity (model 2), and a generalized linear model (GLM).
Upon analyzing the models, we found that the GLM violated the assumption of uncorrelated errors, leading us to exclude it from further consideration. Between model 1 and model 2, although model 1 had a slightly higher adjusted R-squared, slightly lower RMSE, and lower Press statistic, model 2 emerged as the preferred choice due to its ability to address multicollinearity issues and higher F-statistic.
The coefficients of model 2 provided valuable insights into the relationship between the predictor variables and the price of sports cars. We observed that engine size and 0-60 MPH time were inversely proportional to the price, while horsepower had a direct relationship with the price. Using model 2, we made predictions of sports car prices on a test dataset. The Mean Absolute Scaled Error (MASE) for the predictions was less than 1, indicating that the model performed well in estimating the prices of sports cars.
Overall, our study demonstrates that considering the functionality and technical specifications of conventional sports cars can provide valuable insights into their pricing dynamics. However, it is important to acknowledge that the study focused solely on conventional sports cars and did not encompass electric or hybrid cars. Future research can explore the pricing dynamics of electric and hybrid sports cars separately to gain a comprehensive understanding of the entire sports car market.
Edmondson, C. (2011). Torque or Horsepower? Finding the Shift Points. In Fast Car Physics. Johns Hopkins University Press.
McKinsey & Company. (2022). Five Trends Shaping Tomorrow’s Luxury Car Market. Retrieved 1 Jun, 2023 from https://www.mckinsey.com/industries/automotive-and-assembly/our-insights/five-trends-shaping-tomorrows-luxury-car-market
Rattanaporn, K. (2023). Sports Car Prices dataset (Kaggle). Retrieved May 15, 2023 from https://www.kaggle.com/datasets/rkiattisak/sports-car-prices-dataset